[Common] Persistent Grouped MXFP8 quantization kernel#2738
[Common] Persistent Grouped MXFP8 quantization kernel#2738Oleg-Goncharov wants to merge 58 commits intoNVIDIA:mainfrom
Conversation
Greptile SummaryThis PR introduces a persistent grouped MXFP8 quantization kernel with a static grid-stride scheduler. The kernel replaces the previous one-CTA-per-block dispatch with a persistent grid ( Key changes:
Issues found:
Confidence Score: 3/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[nvte_group_quantize_v2] --> B[group_quantize_fwd_helper\nwith QuantizationConfig]
B --> C{Scaling Type}
C -->|MXFP8_1D_SCALING| D[mxfp8::group_quantize]
D --> E[Validate fast_math\nconstraints]
E --> F{is_single_tensor?}
F -->|No| G[update_tma_descriptors\n1 thread per tensor block\nSkips empty rows/cols tensors]
F -->|Yes| H[Use static TMA maps]
G --> I
H --> I[Compute work_blocks_X/Y]
I --> J{PERSISTENT mode?}
J -->|Yes| K[grid = sm_count × 24 × 1]
J -->|No| L[grid = work_blocks_X × work_blocks_Y]
K --> M[group_quantize_mxfp8_kernel\nSHAPE_REP compile-time template\nUSE_FAST_MATH compile-time template]
L --> M
M --> N[Init barriers\nBUFFS_NUM=2]
N --> O{job_finished?}
O -->|No| P[decode_job → JobDescriptor\nis_job_valid check]
P --> Q{job_has_work?}
Q -->|No: empty tensor| R[advance_to_next_job\ncontinue]
R --> O
Q -->|Yes| S[decode_block → BlockDescriptor\nfence_acquire_tensormap if new tensor]
S --> T[Prime pipeline:\nprefetch stage 0]
T --> U[Process STAGES=4 slices\ncolwise + rowwise per stage\nDouble-buffered TMA]
U --> V{IS_DBIAS?\nis_single_tensor?}
V -->|Yes| W[Write partial dbias\nto workspace]
V -->|No| X
W --> X[advance_to_next_job]
X --> O
O -->|Done| Y[atomicMaxFloat amax_ptr\ndestroy_barriers]
Y --> Z{IS_DBIAS?}
Z -->|Yes| AA[grouped_reduce_dbias]
Z -->|No| AB[Done]
AA --> AB
Last reviewed commit: "Merge branch 'main' ..." |
924ff91 to
325181b
Compare
| } | ||
|
|
||
| const float *const thread_in_base = dbias_partial + dbias_in_offset_Y * cols + thread_id * nvec; | ||
| OType *const thread_out_base = dbias_output + tensor_id * cols + thread_id * nvec; |
There was a problem hiding this comment.
Output stride assumes uniform cols across all tensors
The output write offset is computed as:
OType *const thread_out_base = dbias_output + tensor_id * cols + thread_id * nvec;where cols is last_logical_dim — a single value shared across all tensors in the group. This is correct for SAME_BOTH_DIMS and VARYING_FIRST_DIM (where all tensors share the same last dimension), but the kernel receives shape_rep as a parameter and does not enforce that restriction.
For VARYING_LAST_DIM or VARYING_BOTH_DIMS where per-tensor cols differ, the fixed tensor_id * cols stride would compute wrong output offsets. Currently, tests skip dbias validation for these cases, but the kernel would produce incorrect results if actually called with varying-last-dim tensors.
Consider adding a device-side assertion to enforce the precondition:
| OType *const thread_out_base = dbias_output + tensor_id * cols + thread_id * nvec; | |
| if (shape_rep != ShapeRepresentation::SAME_BOTH_DIMS && shape_rep != ShapeRepresentation::VARYING_FIRST_DIM) { | |
| NVTE_DEVICE_ERROR("group_reduce_dbias_kernel requires uniform last dimensions across tensors"); | |
| } |
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
for more information, see https://pre-commit.ci
5815335 to
aa484a3
Compare
| const size_t scales_offset_Y_colwise = scales_block_offset_Y_colwise + tid_Y_colwise; | ||
| const size_t scales_offset_X_colwise = scales_block_offset_X_colwise + tid_X_colwise; | ||
|
|
||
| const bool rowwise_scale_is_within_bounds = scales_offset_X_rowwise < cols; |
There was a problem hiding this comment.
Wrong units in rowwise_scale_is_within_bounds guard
scales_offset_X_rowwise is a scale index (one entry per 32-element column group), while cols is the number of data columns. Comparing them directly means the guard almost never fires.
Concretely, with cols = 96 and SCALE_DIM_X = 32:
scales_offset_X_rowwisefor the four threads of the last (and only) X-block is{0, 1, 2, 3}- Valid scale positions covering real data:
{0, 1, 2}(covering columns 0–31, 32–63, 64–95) - The current check
3 < 96evaluates totrue, so thread 3 still writes a spurious scale for the nonexistent columns 96–127
The correct comparison multiplies the scale index back to column units:
| const bool rowwise_scale_is_within_bounds = scales_offset_X_rowwise < cols; | |
| const bool rowwise_scale_is_within_bounds = scales_offset_X_rowwise * SCALE_DIM_X < cols; |
This correctly excludes scale index 3 because 3 * 32 = 96, which is not < 96.
| __device__ __forceinline__ JobDescriptor decode_job( | ||
| const ShapeRepresentation shape_rep, const bool is_single_tensor, const size_t num_tensors, | ||
| const size_t first_logical_dim, const size_t last_logical_dim, const size_t work_blocks_X, | ||
| const int32_t ctaid_X, const int32_t ctaid_Y, const int64_t *const __restrict__ offsets_ptr, | ||
| const int64_t *const __restrict__ first_dims_ptr, | ||
| const int64_t *const __restrict__ last_dims_ptr) { | ||
| JobDescriptor job{}; | ||
| job.block_id = ctaid_Y * work_blocks_X + ctaid_X; | ||
| job.block_global_offset = is_single_tensor | ||
| ? (ctaid_Y * CHUNK_DIM_Y * last_logical_dim + ctaid_X * CHUNK_DIM_X) | ||
| : (job.block_id * ELTS_PER_CHUNK); | ||
| job.tensor_id = get_current_tensor_id(shape_rep, num_tensors, job.block_global_offset, ctaid_Y, | ||
| first_logical_dim, last_logical_dim, offsets_ptr); | ||
| job.rows = | ||
| get_tensor_rows_num(job.tensor_id, shape_rep, first_logical_dim, first_dims_ptr, num_tensors); | ||
| job.cols = get_tensor_cols_num(job.tensor_id, shape_rep, last_logical_dim, last_dims_ptr); | ||
| return job; | ||
| } |
There was a problem hiding this comment.
This should be a constructor of the JobDescriptor struct (you can make the constructor __device__ too).
| __device__ __forceinline__ BlockDescriptor | ||
| decode_block(const JobDescriptor &job, const bool is_single_tensor, | ||
| const int64_t *const __restrict__ offsets_ptr) { | ||
| BlockDescriptor block{}; | ||
| block.tensor_base = is_single_tensor ? 0 : static_cast<size_t>(offsets_ptr[job.tensor_id]); | ||
| const size_t CHUNK_DIM_X_ = CHUNK_DIM_X; | ||
| const size_t blocks_X_num_in_current_tensor = DIVUP(job.cols, CHUNK_DIM_X_); | ||
| block.block_id_in_current_tensor = | ||
| is_single_tensor ? job.block_id : (job.block_id - block.tensor_base / ELTS_PER_CHUNK); | ||
| block.block_id_Y = block.block_id_in_current_tensor / blocks_X_num_in_current_tensor; | ||
| block.block_id_X = block.block_id_in_current_tensor % blocks_X_num_in_current_tensor; | ||
| block.block_offset_Y = block.block_id_Y * CHUNK_DIM_Y; | ||
| block.block_offset_X = block.block_id_X * CHUNK_DIM_X; | ||
| return block; | ||
| } |
There was a problem hiding this comment.
Similarly this should be a constructor too.
| const size_t global_offset_Y, const size_t buff_offset, const size_t shmem_buff_size, | ||
| uint64_t *barrier, const bool leading_thread) { | ||
| if (leading_thread) { | ||
| ptx::mbarrier_arrive_expect_tx(barrier, shmem_buff_size); |
There was a problem hiding this comment.
2 questions - why is this done before the TMA call and why is it done only by the leading_thread? In the other parts of the code (e.g. in ptx::copy_2d_to_shared) we do transfer, then arrive_expect on the leading thread and just arrive on all the other threads.
There was a problem hiding this comment.
ptx::mbarrier_arrive_expect_tx is also called by a single thread in ptx::copy_2d_to_shared. I initialized the barriers using a single thread, which is sufficient for it to work. But we can also keep the previous approach, where all threads in the block participate explicitly. And since the async copy and expect_tx are in the same phase, it’s also valid to issue expect_tx first.
| if (launch_block_id >= total_work_blocks) { | ||
| return; | ||
| } |
There was a problem hiding this comment.
For example, for small input tensors where total_work_blocks is less than SMs * K, with K = STATIC_PERSISTENT_BLOCKS_PER_SM
| last_logical_dim, work_blocks_X, ctaid_X, ctaid_Y, offsets_ptr, | ||
| first_dims_ptr, last_dims_ptr); | ||
| allow_next_job_prefetch = | ||
| is_job_valid(prefetch_job, shape_rep, total_work_blocks, offsets_ptr); |
There was a problem hiding this comment.
If we are prevalidating the next job here, then why do we need earlier the check if the job we are about to do is going to be valid and draining it if it is not?
There was a problem hiding this comment.
We prefetch the first stage of the next CTA at the end of processing the current CTA. This check is only to avoid copying data for null blocks. The main termination check, i.e., when to stop processing the current chunk and exit the loop is at line 770.
| is_job_valid(current_job, shape_rep, total_work_blocks, offsets_ptr); | ||
| if (!current_job_is_valid) { | ||
| if (has_prefetched_current_job) { | ||
| // A stage-0 prefetch may already be in flight for this CTA. Drain it before exiting. |
There was a problem hiding this comment.
We destroy the barriers after exiting the loop. But this invalidation can be done once the mbarrier objects are guaranteed to have completed their current phase (drained). Otherwise, the TMA engine may finish the copy and attempt to call complete on an already invalidated mbarrier
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
7c41a6a to
6874935
Compare
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
7bdc696 to
5068556
Compare
for more information, see https://pre-commit.ci
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
for more information, see https://pre-commit.ci
| {VARYING_LAST_DIM, 3, 256,896, 128,256,512}, | ||
| {VARYING_BOTH_DIMS, 2, 1,(128*128)+(256*256), 128,256, 128,256}, | ||
| // Empty tensor in the middle of the group must not terminate the persistent work loop. | ||
| {VARYING_BOTH_DIMS, 3, 1,(128*128)+(128*128), 128,0,128, 128,0,128}, |
There was a problem hiding this comment.
Can you also add a test case for only varying first dim that contains as group of size zero
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
|
/te-ci |
Description
This PR adds a persistent grouped MXFP8 quantization kernel with static scheduling.
Type of change
Changes
Checklist: